STIC-ILL 




From: 

Sent: 

To: 



Lukton, David 

Sunday. March 24, 2002 6:40 PM 
STIC-ILL 



David Lukton 
308-3213 
AU 1653 

Examiner room; 9B05 
Mailbox room: 9B01 
Serial number: <=:Qa^ 



L20 ANSWER 7 OF 158 MEDLINE 

AN 2002060236' MEDLINE 

DN 21646738 PubMed ID: 11786783 

Tl The prevalence of negative studies with ***inadequate*** 

***statisticar** power: an analysis of the plastic surgery literature. 
AU Chung Kevin C; Kalliainen Loree K; Spilson Sandra V; Waiters Madonna R; 
Kim Hyungjin Myra 



SO PLASTIC AND RECONSTRUCTIVE SURGERY, (2002 Jan) 109 (1) 1-6; 
discussion 7-8. 

Journal code: 1306050. ISSN: 0032-1052. 



********* 



* * 



1 



The Prevalence of Negative Studies with 
Inadequate Statistical Power: An Analysis of 
the Plastic Surgery Literature 

Kevin C. Chung, M.S., Lorce K KaUiainen, M.D., Sandra V. Spilson, M.P.II., 

Madonna R. Walters, M.S., R.N., and Hyungjin Myra Kim, Sc.D. 

Ann Arhm, Mich, 



Studies published in the medical literature often ne- 
glect to consider the statistical power needed to detect a 
meaningful difference between study groups. Small sam- 
ple sizes tend to produce negative results because of low 
statistical power. Studies that cannot make conclusive 
statements about their hypotheses can waste resources, 
deter further research, and impede advances in clinical 
treatment. The current study rc\'icwcd three of the most 
frequently read plastic surgery journals from 1076 to 1996 
to determine the prevalence of inadequately (<80 per- 
cent) powered clinical trials and experimental studies that 
found no difference (negative studies) in the response 
variable of interest between comparison groups. The sta- 
tistical power of 54 negative studies using continuous re- 
sponse variables was calculated to detect a difference of 1 
SD (±1 SD) in means between the comparative groups. 
The power of another 57 negative studies with dichoto- 
nious response (yes/no) variables was calculated to detect 
a relative change in proportions of 25 percent and 50 
percent from the experimenLil to the control group. It was 
found that 85 percent of the studies with contiiuious re- 
sponse variables had inadeqtiate power to detect the de- 
sired mean ditlererice of ±1 SD. In studies with diehot- 
omous response variables, 98 percent had inadequate 
power to detect a desired 25 percent relative change in 
proportions, and 74 percent had inadequate power to 
ilctectadesiied 50 percent relative cliange in proportions. 
These results indicate that many of the studies in the 
pliistic surger)' literature lat k adt:(|uate power to dete<;t a 
inoderate-to-large difference between groups. The lack of 
power makes the interpretation oi the studies with neg- 
ative findings inconclusive. Proper study design dictates 
tliat investigators consider a priori die difference between 
groups that is of clinical interest, and the sample size per 
gi-oup that is needed to provirle adequate statistical power 
to detect the desire<l difference. {PlasL Rccotistr. Surg. 
109: 1, 2002.) 



An experimental study, such as a clinical trial 
designed to demonstrate the efficacy of a new 
intervention, involves generalizing the treat- 
ment effect in the population using informa- 
tion obtained from a subset or a sample. The 
clinically mcaningliil treatment effect is the 
minimum effect necessary for the experimen- 
tal factor under study (drtig, device, treatment, 
or procedure) to be relevant or important to 
society.' Hypothesis tesdng requires a ntiU hy- 
pothesis, a statement that no difference (treat- 
ment effect) exists between the experimental 
and control groups, and an alteiiiative hypoth- 
esis, a statement diat a difference exists be- 
tween the comparative groups. 

Whenever a generalization is made about a 
population on the basis of information from a 
sample, it is always possible that the conclusion 
will be inaccurate simply because of sampling 
variability. The two types of such errors are 
type I, rejecting die null hypothesis and falsely 
concluding that there is a treatment effect, and 
type II, failing to reject the null h^'pothesis and 
erroneously concluding that there is no treat- 
ment effect.-^ An example of type T error occurs 
if a study investigadng the efficacy of a new 
surgical procedure concludes that the new pro- 
cedure has a significantly higher success rate 
than the control procedure, when this is not 
true at the population level. An example of 
type IT error occurs when a true diff(^r(mc:e in 



Krom the University of Michigan Hiuul Cenicr, Section of Plastic Surgery, Department otSitrgcry, the University of Michigan Medical c :entcr 
Comer for Statistical (>>nsultation Rrst^m ch. University of Michigan, and St. Joseph Mercy Hospital. Received for publication July 25. 2000; 
revised March 9, 2001. 

Pi rsentrtI al ihr fitnh Annual Meeting of ilie American Societj' of Plastic ioid Rr* ( msh uclivc Surgeons, in San Francisco, California, on 
September 24, 1007; and the Plaslii: .Surgci^ Research Council, in Galveston, Texas, on P>l»ni.ny 28, 1997. 



Material may be protected by copyright law (Title 17, U.S. Code) 



2 



PL/VSTIC AND RECONSTRUCT! VT-: SURGERY, /rtnwary 2002 



success rates between the new and the control 
procedure at the population level is not found 
in the study sample. The quantity a (the Greek 
letter alpha) represents the probability of mak- 
ing a type 1 error, whereas the quantity P (the 
Greek letter beta) represents the probability of 
making a type 11 error. 

The statisdcal power is the probability that a 
sample can detect a treatment effect, should a 
true treatment effect exist in the populadon. It 
is the value 1 - (B;^ and it increases with in- 
creasing sample size. Power is an important yet 
underused concept in published studies. Re- 
searchers often guard against type 1 error by 
setting a low significance (a) level, typically less 
than 0.05, to consider a chosen test stadsdcally 
significant. However, a review of published re- 
search reveals a common lack of power,^-^ 
which corresponds to large 0. A study with 
high statisdcal power provides strong support 
for the decision not to reject the null hypoth- 
esis, whereas a study with low statisdcal power 
has litde support for either the null hypothesis 
or the alternative hypothesis.*^ Proper study de- 
sign requires investigators to assume accept- 
able levels of a and p and to consider a study 
sample size that will give adequate power to 
detect the treatment effect that is clinically 
meaningful and most likely to occur. 

Inadequately powered studies, that is, studies 
with insufficient ability to reliably detect the 
clinically meaningful difference between ex- 
perimental and control groups, can yield neg- 
ative results simply because of lack of power. 
Furthermore, such studies waste valuable re- 
sources, deter further research, and curtail ad- 
vances in clinical treaUnent, especially when 
the negative result is misinterpreted as showing 
no treatment elfect. On the other hand, a re- 
liable and informative null finding can safely 
allow resources to be redistributed to odier 
important areas of research. Medical journals 
are considered an effective means of commu- 
nication between investigators and practitio- 
ners. When studies published in these journals 
fail to resolve uncertainty about the research 
question, they leave a large body of statistical 
information open to misinterpretation. De- 
spite the weaknesses of their results, these pub- 
lications may have a profound iiupact on med- 
ical practice. A study of the summarization of 
research results found that physicians with ad- 
vanced training in research design and analysis 
were as likely as other physicians to base their 



treatment decisions on the way in which study 
results were sunuuarized.^ 

In general, plastic surgeons lack a strong 
background in statistics; errors in the use and 
interpretation of statistical analysis are fre- 
quently found in the plastic surgeiy literature.^ 
Furtiiermore, a lack of statistical awareness is 
widely accepted in both academic and clinical 
medicine.^ We reviewed the plastic surgery lit- 
erature to determine whether clinical trials 
and experimental studies diat reported no dif- 
ference between experimental and control 
groups were adec^uately powered to make this 
conclusion at the population level. 

Mf.fhods 

We searched the MEDLINE database to 
identify clinical trials with negative findings 
published from 1976 to 1996 in the plastic 
surgery literature. The search was limited to 
three journals: Annals of Plastic Surgery, British 
Journal of Plastic Surgery^ and Plastic and Recon- 
structive Surgery. Key w^ords used in the search of 
human and animal studies included "clinical" 
and "clinical trial (s)." Animal studies were lim- 
ited to cats, dogs, mice, rats, rabbits, and swine. 

Inclusion criteria for each study included a 
statement of "no difference" between the con- 
trol and experimental groups in the abstract, 
specification of a statistical test, and use of a 
statistical test amenable to power calculations 
(e.g., a comparison of means or proportions). 
Exclusion criteria included the lack of a study 
abstract in MEDLINE, use of only univariate or 
descriptive data, and incomplete data. Thus, 
only comparative studies with negative findings 
are included in diis report. 

For each study, we selected a single key re~ 
spouse variable noted in the tide or abstract to 
have a negative effect. If several variables w^ere 
equally important, the first one listed was ana- 
lyzed. All response variables included in this 
study w^ere of only two types: continuous or 
dichotomous. An example of a continuous vari- 
able is a measure of grip strength; it can theo 
retically assume a value along a continuum, 
within a specified range. On the other hand, 
dichotomous variables can assume one value 
out of only two options, such iis yes/no or 
success/failure. As the main summary statistics, 
w^e recorded the mean for continuous vari- 
ables, and the proportion of response of inter- 
est, such as the proportion of success, for di- 
chotomous variables. It is these summary 
statistics that are \ised to make coiriparisons 



Material may be protected by copyright law (Title 17, U.S. Code) 



Vol 109, No. 1 / STUDIES WITH INADEQUATE S lA I IS TICAL POWl^K 



3 



between experimental and conirol groups 
when performing sUitistical h>pothesis testing. 

The power of a study is meaningful only 
when it is associated with detecting a specific 
clinically meaningful treatment effect (differ- 
ence) that is most likely to occur. The power of 
a study depends on the particular statisdcal test 
that is used to detect the treatment effect and 
increases wdth increasing sample si/e and effect 
si/e. Thus, to calculate the power of a study, it 
is necessary to specify the desired a level, sam- 
ple size, and the magnitude of the clinically 
meaningful treatment effect (effect size). The 
desired effect sizes were chosen for the two 
response variable t)'pes from w^hat are generally 
considered clinically meaningful differences 
(described below). All power calculations as- 
sumed a two-sided stadstical test based on a 
0.0.5 a level, which is conventionally the ac- 
cepted level of significance in medical studies. 
This a level allows for a 5 percent chance of 
concluding that a treatment effect exists when, 
in tact, it does not. For each suidy, statistical 
power w^as considered ^'adequate" if, using the 
reported sample size, the power to detect the 
desired effect size met or exceeded 80 percent. 
The power calculation methods are described 
below^ for each of the response variable types. 
The Stata stadstical software applicadon (Stata 
Corporation, College Stadon, Texas) was used 
to calculate the power. 

For dichotomous response variables, power 
was calculated on the basis of using the two- 
sample proporuon test^^* to detect both a 25 
percent and a 50 percent relauve difference in 
outcome propordons in the experimental 
group from that of the control group. For ex- 
ample, in a study that intends to show a reduc- 
uon in surgical complicadons, the clinically 
meaningful effect is a 25 percent (or 50 per- 
cent) decrease in the propordon of complica- 
dons in the experimental group relative to the 
control group. On the other hand, for a study 
intending to show an iTnj)rovement in surgical 
success, the clinically meaningful effect is a 25 
percent (or 50 percent) increase in the pro- 
pordon of successful outcome in the experi- 
mental group relative to the control group. 
Freiman et al. defined both the 25 percent and 
50 percent reladve change in propordon as 
"clinically important" dispari des.^ 

In addition to the meaningful effect size, the 
power calculadon also requires the populadon 
SD (SD) of the responses variable. If stadsdcal 
power were considered befor e the onset of the 



study, it would be desirable to have the popu- 
ladon SD. However. bec2uise the population 
SD is generally not available, it is customaiy 
and recommended to use the SD obsewed in 
the sample. In the case of dichotomous re- 
sponse variables, the populadon SD depends 
endrely on the outcome propordons in the 
populadon.*^ In power calculadons for dichot- 
omous response variables, we used a pooled SD 
derived from the reported outcome propor- 
tion for the control group and the desired 
outcome proportion for the experimental 
group. 

For continuous response variables, power 
was calculated on the basis of using the two- 
group t test,'" and we considered a difference 
of 1 SD (±1 SD) between the experimental 
and control group means to be clinically mean- 
ingful. This effect size is greater than ±0.8 SD, 
which is generally considered a large effect 
size.'^ Because the effect was considered in the 
unit of SD, it was analogous to assuming that 
the reported sample SD of pooled data from 
the control and experimental groups would 
represent that of the populadon SD. It w^as 
assumed that the included studies met the as- 
sumptions (interval data, random, and inde- 
pendent samples, normal distributit)ns, and 
equal variances between groups) required to 
use the two-group ^test.^^ 

Resuits 

A total of S66 articles met the search criteria 
in MEDLINE. Of these 366 ardcles, 111 met 
the previously defined inclusion criteria. A list 
of the 1 1 1 ardcles is available from the authors 
on request. Table 1 illustrates characteristics of 
the reviewed studies. The majority of ardcles 
(54 percent) were in Plastic and REconsbixctive 
Surgeij, followed by the Annals of Plastic Surgery 

TABLE 1 

Charactt-ristits of the 111 Reviewed Studies''^ 



Suuly C-hanicleriiitits " 

)oiirnnl.<i 



A nnals of Plastic Surgery 


30 


27 


Bniish Jnutnal of Plastic Sujgety 


21 


19 


Plrisiif and Hfamstrm.tii.'e Siir^try 


r>o 


54 


Suidy sLibjccls 






Animal 




80 


Human 




20 


Stiuly a:.sponsc variables 






Conij)<irisoii i)f means 


51 


'to 


( >>Tn|j;u isoii of proporlions 


5*7 





* i\ hsi of tht- rrvii;w<rd mtklt;;* iri ax-jnlnble upon request 



Material may be protected by copyright law (Title 17, U.S. Code) 



4 

(27 percent), and the Bnlish Journal of Plastic 
Surgery (19 percent). The majority of studies 
were conducted with animals (80 percent), 
and response variables were almost equally di- 
vided between continuous (49 percent report- 
ing means) and dichotomous (nl percent re- 
porting proportions) variables. 

Of the 57 studies with dichotomous response 
variables, 98 percent (56 of 57) had inadequate 
power to detect a 25 percent difference in the 
experimental group proportion relative to the 
control group proportion (Table II). For de- 
tecting a 50 percent difference in the experi- 
mental group proportion reladve to die con- 
trol group propordon, 74 percent (42 of 57) of 
the studies had inadequate power. Of die 54 
studies with continuous response variables, 85 
percent (46 of 54) had insufficient power to 
detect a difference of ± 1 SD in means between 
the experimental and the control group. 

Of the studies with dichotomous response 
variables, 67 percent (.^8 of 57) had less than 
20 percent power to detect a 25 percent rela- 
tive difference in proportion. Similarly, 42 per- 
cent (24 of 57) of the studies had less than 20 
percent power to detect a 50 percent differ- 
ence in propordon. Therefore, close to half of 
the studies examined had greater than 80 per- 
cent chance of missing a true treatment effect 
of 50 percent relative difference in propor- 
tions. For studies with condnuous response 
variables, none had less than 20 percent power. 

Discussion 

A lack of adequate power in chnical snidies 
continues to afflict much of the medical liter- 
ature. This study evaluated statisdcai power of 
experimental studies and chnical trials with 
negative findings that were published during a 
20-year interval in three plasdc surgerv' jour- 
nals. Our resultji indicate that the majority of 
negative studies published in the plasdc sur- 



PLASTIC AND RECONSTRUCnVTL SURGERY, /an wary 2002 

gery hterature had instifficient power (<0.80) 
to detect a true 25 percent, 50 percent, or ±1 
SD difference in outcome variables. Further- 
more, a large propordon of the studies with 
inadequate power had much less power than 
the minimum recommended level of 0.80. Our 
findings are similar to those reported in earlier 
studies.^ '^ This lack of power does not allow for 
a reliable conclusion of no treatment effect 
from these negative studies, but it does suggest 
a need for more careful consideradon to study 
design. 

The purpose of hypothesis tesdng is to rule 
out the possibility that an observed difference 
is attributable to chance.*^ A type T error occurs 
if the null hypothesis of no difference is re- 
jected when there is no true populadon differ- 
ence, whereas a type II error occurs if the null 
hypothesis is not rejected when there is a true 
population difference.*^ Power is the probabil- 
ity of rejecting a null hypothesis when it is false, 
thereby avoiding a type II error.-^ Power de- 
pends on the size of the effect, type I error level 
(a), sample size, and die estimated SD (for 
continuous variables).'^ Reliable results can be 
obtained when study participants are randomly 
selected from a population of interest with re- 
spect to key characteristics important to the 
study at hand.^^ Note that it is possible that a 
chosen sample can be biased and not represen- 
tative of the population, but hypothesis testing 
does not address this bias. 

For this sttidy, power was evaluated for mod- 
erate or large effect size. For condnuous re- 
sponse variables, an effect size of ± 1 SD corre- 
sponds to expecting a 1 SD increase, for 
example, in mean grip strength for patients 
treated with a new procedure compared with 
patients treated with a control procedure. Hav- 
ing 80 percent power to detect this difference 
requires a sample size of 17 in each group; 
dnis, our analysis essentially revealed that 85 



TABLE n 

Distribution of Reviewed Studies with InsufTicicnt (<a.80) Power 



Dir>HiU»iuni?* Response Variables* 



2r>% Relative Differenct! 



0.ti(Hr)-70 



38 
U) 
5 

1 



66.7 
17.5 
8-8 

1.7 



5<)% Relative Difference 



24 
9 
1 
8 

15 



42-1 
1.5.8 
1.8 
14.0 
26.3 



CVjiirinuoas Response Vnriables* 



0 
12 

19 



0.0 

22.2 
27.8 
35.2 
H.8 



57 .u.dics; power ,o detect a 25% or cliffcren.r in expcnnuMual group proponion relative rn conrrol «roup pn.pr.mon 
t f>-} stiiilieM power lo detect ± 1 SD difference in me;uK 



Material may be protected by copyright law (Title 17, U.S. Code) 



VoL 109, No, 1 / STUDIES WITH INADEQUATE STATISTICAI. POW.R 



5 



percent of the reviewed studies with continu- 
ous response variables had a sample size of less 
than 17 in each group. For dichotomous re- 
sponse variables, a 50 percent relative differ- 
ence in complication, for example, from a con- 
trol group rate of 30 percent, corresponds to 
an experimental group complication rate of 15 
percent. A sample size of 134 per group is 
required to have 80 percent power to detect 
this difference. On the other hand, a 50 per- 
cent relative difference in success rate from the 
control group rate of 30 percent corresponds 
to an experimental group success rate of 45 
percent, A sample size of 176 per group is 
required to have 80 percent power to detect 
this difference. The majority, 74 percent, of 
the studies with dichotomous response vari- 
ables had inadequate power to detect a relative 
difference of 50 percent. Although each of 
these studies concluded that no difference was 
found between the two groups, the studies with 
inadequate power had greater than 20 percent 
chance of missing the true large effect, not 
allowing a reliable null conclusion. 

In the study design phase, researchers first 
must determine the clinically meaningful ef- 
fect size and then calculate the sample size 
needed to provide adequate power to detect 
the desired difference using a fixed level a test 
(conventionally set at 5 percent). A larger sam- 
ple size is required to detect a smaller effect 
size at the same levels of power and a. Too 
large a sample size, however, can waste re- 
sources by finding a statistically significant dif- 
ference that may not be clinically significant.^*^ 
A larger than necessar>^ sample may also be 
unethical and could put additional patients at 
risk in a clinical study.^^ Therefore, study de- 
sign involves the delicate balance of several 
statistical factors. Sample size calculation can 
be difficult and can at best provide only an 
estimate, because the infonnation needed for 
the calculation, such as population SD, is an 
estimate and is often difficult to obtain. Yet 
giving a priori consideration to the power of 
the study alloAvs investigators to readily see if 
the proposed study would provide reliable re- 
sults. If the powder is found to be inadequate to 
detect the desired effect, a larger study or a 
multicenter study should be considered to pro- 
vide an increased sample size. 

Even if proper considerations are given to 
power and sample size while designing the 
study, negative findings from comparative stud- 
ies can lack power for various reasons (e.g., an 



incorrect estimation of the control group out- 
come proportion for dichotomous response 
variables, and a larger sample SD than popula- 
tion SD for continuous response variables). 
Our findings, however, still indicate the need 
for more careful consideration of study design, 
including more cautious assessment of pilot 
data to obtain better estimates of the informa- 
tion required for sample size calculations. 

When a study finding is negative, post hoc 
power calculation prorides a way to assess the 
reliability of the negative conclusion. Often the 
study finding can be negative if the observed 
effect size is smaller than the expected. Post 
hoc power calculation, however, must be per- 
formed using the effect size that is considered 
clinically meaningful, which is not necessarily 
the obsen'ed effect size. If it was felt that the 
expected effect size was too optimistic, then 
the observed effect size may be more realistic 
and also clinically meaningful. The results 
from the negative study can then provide pilot 
data for a similar future investigation in which 
a more realistic effect size can be used for the 
sample size consideration, Frciman et al, noted 
that 20 percent of 71 negative studies recog- 
nized the need for a larger sample size in their 
Discussion section.* 

We evaluated whether there was a decreasing 
trend over time in the percentage of studies 
with inadequate power. The studies were 
grouped into 5-year intervals from 1976 to 
1996, and a study was considered to have inad- 
equate power w4icn it had less than 0.80 power 
to detect a 50 percent or ±1 SD relative treat- 
ment effect. For the period 1976 and 1980, 67 
percent (4 of 6) of the studies had inadequate 
power, as did 91 percent (10 of 1 1) for 1981 to 
1985, 68 percent (27 of 40) for 1986 to 1990, 
and 87 percent (47 of 54) for 1991 to 1996. 
The percentage of negative studies with inad- 
equate power, therefore, did not seem to de- 
crease over the 20-year intei^al in the plastic 
surge IT literature. 

The current study included the iissessment 
of the three most frequently read journals in 
the plastic surgery field, the reviewed studies 
representing highly valued, peer-reriewed liter- 
ature. To our knowledge, this study is unique 
in its evaluation of statistical power in negative 
studies in the plastic surgery literature alone. 
In this study, we found not only that many 
studies had less than 80 percent power, but 
also that a large proportion had power that was 
far below adequate. Our study emphasizes the 



/ 
! 



Material may be protected by copyright law (Title 17, U.S. Code) 



6 

iinporlance of a careful interpretation of a neg- 
ative finding so as not to discard a new inter- 
vention that might be beneficial. A good un- 
derstanding of stadsdcal pov^er is important 
when interpredngand using hterature findings 
to formulate treatment decisions. It also high- 
lights the importance of a careful study design 
that would require a sample size large enough 
to obtain a reliable conclusion from a clinical 
5it^»dy. ^^^.^ ^ Chung 

Section of Plastic Surgery 

University of Michigan Medical Center 

2130 Taubman Center 

1500 E. Medical Cen ter Drive 

Ann Arbor, Mich. 48109-0340 

kecchung@umich. edu 

REFERENCES 

1. Kiat^iner, H. C, and Thiemann, S, Hoxo Many Subjects? 

Statistical Pmner Analysis in Rpsfinrch. Newbury Park, 
Calif: Sage Publications, 1987. P. 24. 

2. Colicn, J. Statistical Fozoer A^ialysis for Die Behavioral Sci- 

ences, 2nd Ed. New York: Academic Press, 1977. P. 5. 

3. Remington. R. D., and Scliork, M. A. Statistics with Ap- 

plications to the Biological and Health Sciences, 2tu{ Ffl- 
Englewood Cliffs. NJ.: PrenUce-Hall, 1985. Pp. 168, 
175-1 7a 

4. Frciman. \. A., Chabiici-i, T. C, Smiih» H.Jr., and Kue- 

bler, R. R. The importance of beta, the ty\u: U rm>r 
and sample size in the design and interpretation of the 
randomized control trial: Survey of 71 "negative" tri- 
als. N. KngLJ. Med. 299: 690, 1978. 

5. Chung, K. C, Kalliainen, I.. K., and I layvvard, R. A. Type 



PU\STIC AND RECONSTRUCTrV'K SURGERY, /rtrzw/^r)' 2002 

W (beta) errors in the hand literature: The importance 
or power. / HandSurpr. (Am.) 23: 20, 1998. 
(). Reed, J. F., Ill, and Staichert, W. Statistical proof in 
inconclusive "negative" trials. Arch, Intern. Med. 141: 
1307, 1981. 

7. Forrow, L., Taylor. VV. C, and Arnold, R. M. Absolutely 
relative: How research results arc summarized can 
affect treatment decisions. Am. J. Med, 92: 121, 1992. 

8- Kii/.on, \V. M., Jr., Urbanchek, M. G., and McCabe, S. 
The seven deadly sins of siati.stical analysis. Ann. Plast. 
Surg, 37: 265, 1996. 

9. Allinan, D. G. Tlie scandal of poor medical research 
(Editorial). Br. Med. J. 308: 283, 1994. 

10. Pagano, M.. and C;auvreau, K. Principles of fSiostntistics, 

2ad Ed. Belmont, Calif.: Duxbur>' Press, 1993. Pp. 
293-306. 

11. Fleiss, J. L. Statistical Methods for Hates and P)Oportioyi.y 

2nd Ed. New York: Wiley, 1981. Pp. 33-49. 

12. Clegg, F. Introduction to statistics Til: Correlation, chi- 

square and the choice of statistical procedure. Br. 
J. Hasp. Med. 40; 396, 1988. 

13. Flenbaas, R. M., Elcnbaas. J. K., and Cuddy, P. C. Eval- 

uating the medical literature: Part II. Statistical anal- 
ysis. Ann. Evierg. Med. 12: 610, 1983. 

14. Hennekens, C. H., and Buring, ]. E. Epidewiology in Med- 

icine. Boston: Little, Brown, 1987. Pp. 182-183, 259, 
264. 

\b. Helberg, C. Pilfalls of data analysis (or how to avoid lies 
and damned lies), 1995. Available at: hTtp://w\\^v. 
rdsu.wjsc.edu/piLlalIs. Accessed October 6, 1997. 

16. Lindgren, B. R., Wielinski, C. L., Finkelstein, S. M., and 

Wanvick, \V. f . Contrasting clinical and statistical sig- 
nificance within the research setting. Pediatr. Pubnonol. 
16: 336, 1993. 

17. Schor, S. Statistical proof in inconclusive '"negative" tri- 

als (Editorial). Arch. Intern. Med. 141: 1263, 1981, 



Material may be protected by copyright law (Title 17. U.S. Code) 



Discussion 



The Prevalence of Negative Studies with Inadequate Statistical 
Power: An Analysis of the Plastic Surgeiy Literature 

by Kevin C. Chung, M.D., M.S., Loree K. Kalliainen, M.D., Sandra V. SpiJson, M.P.H., 
Madonna R. Walters, M.S., R.N., and Hyiingjin Myra Kim, Sc.D. 



Discussion by Virginia A. Clark, Ph.D. 

The authors* central message is that proper 
study design requires taking an adequate sam- 
ple size in the treatment groups and that this 
may not have been done in some articles pub- 
lished in plastic surgery journals. They ana- 
lyzed 111 articles that met the following entn^ 
criteria: (1) the articles had to include a state- 
ment of no difference between the control and 
experimental treatment groups, eitlier for the 
key response variable or for the first one listed; 
(2) there could be only two groups; and (3) the 
response variable had to be either dichoto- 
mous (yes, no) or continuous wherein two 
means were being compared using a two-group 
/ test. Overall, 20 percent of the articles had 
humans as subjects and 80 percent had ani- 
mals. Using power calculadons computed from 
die sample results given in these articles, they 
concluded that a high proportion of these neg- 
ative studies had inadequate power to detect a 
moderate or large difference between the two 
groups. 

I do not think that any researcher or suids- 
tician would argue against taking an adequate 
sample size so that a study can reach conclusive 
results. Having said that an adequate sample 
size should be taken, it is also fair to say that the 
devil is in the details. 

First, the statistical tests that the authors 
chose to examine are ones for which the for- 
mulas to determine the sample sizes are not 
difficult to find. For more complex analyses of 
variance, frequency rabies with more than two 
rows or columns, nonparametric tests, or other 
less used tests, fmding appropriate formulas 
can be difficult and die iiiformadon needed to 

Received for publication April 4. 2(K)1. 



use the formulas increases dramatically. Refer- 
ence numbers 1, 2, and 11 of the ardcle are 
useful in this regard, as is a stadsdcal program 
called nQuery that was written to provide ei- 
ther sample size or power for a wide variety of 
tests. 

Second, the estimation of sample size for, 
say> a t test assumes that one knows the true 
population standard deviation (SD) and the 
difference in the population means that one 
wants to detect. The sample size n is propor- 
tional to the square of the differences divided 
by the SD squared. The result is that small 
mistakes in estimating these quantities will re- 
sult in sizable changes in the estimate of n 
needed. 

Third, many studies have small sample sizes 
because of limited availability of space or ability 
to care for animals or patients with unusual 
conditions and other factors. Steps that re- 
searchers can take to ameliorate the problems 
of small sample sizes arc as follows: 

1. Either make the major response variables 
continuous measurements or devise scales 
so they are not strictly ycs/no or unordered 
oiucomes. Note that in the Discussion sec- 
tion of the authors* article where they dis- 
cussed dichotomous data, the needed sam- 
ples sizes ranged from 134 to 176, but for 
continuous data only 17 observations were 
needed in each group to detect a difference 
of 1 SD. Ordered responses can be analyzed 
using nonparametric methods, which also 
tend to be efficient. 

2- Make the measurements as accurate as pos- 



Material may be protected by copyright taw {Title 17, U.S. Code) 



8 



pu\stk: and reconstructive surgery, Jan wan* 2002 



sible. Measurement error conlribuLes to the 
SD. Consider taking duplicate or triplicate 
measurements. 

3, Eighty percent of the negative articles re- 
viewed were animal sUidies. There are prac- 
tical limitations on ho\v many animals can 
be used, but sometimes in these studies in- 
vestigators seem to choose too many exper- 
imental groups. The result is small sample 
sizes in each group. Consideration should 
he given to focusing on what groups are 
really needed. No single study can answer 
all questions. 

4. For studies on padents with unusual condi- 
tions, arrange ahead of time with colleagues 
at other institutions to gadier a minimal set 
of data in the same fashion so Uiat l esults 
from several insututions can be analyzed as 
thither combined or separate but compara- 
ble studies. 

The dde of the ardcle includes the words 
"the prevalence of negative studies." For Plastic 
and Reconstructive Surgeiy, 60 negative articles 
were found out of an estimated 10,000 articles 
(20 years were reviewed, and the journal pub- 
lishes ab{)ut 500 ardcles per year), for a preva- 
lence of less than 1 percent. 

On a positive note, I think the yioi/r??//^ should 



condnue publishing some articles that have 
negative result.s for key variables or first vari- 
able listed. Many of these articles have interest- 
ing results in other variables or other valuable 
findings. Also, the authors* suggesdon that re- 
searchers make post hoc power calculadons 
seems to me to be a waste of time. No post hoc 
calculadon of power can change anything, and 
researchers would be better ott analyzing the 
information from their negative study to better 
design the next study. ^ However, authors 
shoidd be careful when they word the results of 
a negative study to conclude that they were 
unable to Find a significant difference. In their 
Discussion secdon, investigators should men- 
don the possibility that, if a larger sample were 
taken, significant results could be found. 

Virginia A, Clark, Ph.D. 

Professor Emeritus, Biostatistics and 
Bio ma thematics 

852 Sparseen Road 

Sequim, Wash. 98382 

clark@olympus. net 

REFERENTS 

1. Hoenig, J. M., and tieisey, U. M. The abuse of power: 
The pci^asivc fallac)' of power calculations for data 
analysis. Th^ American Statutician 55: 10, 2001. 



Material may be protected by copyright law {Title 17. U S Code) 



